Name: Daniel Kojo Afealete Fiadjoe

Student ID: 202291439

Course: DSCI-6607 (Programmatic Data Analysis Using R and Python)

Project work 2

INTRODUCTION

The purpose of this project is to do a comprehensive work on a provided Hepatitis data set. The work will involve cleaning of data, application of correct statistical methods on the data and complete analysis of the data. It would also include correct and adequate interpretation and discussion on data, graphs, tables and results.

The following areas were covered during the project work:

  1. Importing wine data into Jupyter Notebook. This platform was used throughout the project.
  2. Review of data in the text file.
  3. Cleaning of the data.
  4. Exploratory of data through visualization and
  5. Drawing insights from the data.

Relevant Information:

Loading of data and preprocessing

Observations:

  1. Number of object data type is = 16
  2. Number of int64 data type is = 3
  3. Number of float64 data type is = 1.

Checking for missing data

Analysis so far shows that there is evidence of additional missing data which is being represented with "?" and ".". The data also has misleading values such as '9999' and '99999'. This needs to be identified and treated in the data. In order to identify the missing data, I will reload the data and use the function - na_values.

Observations:

  1. Number of float64 data type is = 17
  2. Number of int64 data type is = 3

Observation:

- Majority of the missing data is from the variable 'Protime' with 43% missing data. This is followed by 23% of the variable 'Alk_Phosphate'.

The introduction of the function na_values in the importation of the data of the data has increased the number of missing values from 10 to 193. Also the data type has been converted from object to float.

Checking for missing data and imputing them.

Treating missing values is an important step in cleaning the data and making it ready for further analysis and usage. It is thus important to correctly identify them. Having an idea of how these values are present can give us directions in treating them. In this data, we will be imputing the data with using the function Bfill.

The above information display shows that the missing data in this data has been resolved completely without deleting.

Observations:

  1. There could possible outliers in the variables - Bilirubin, Sgot, Alk_Phosphate, and Protime.
  2. The Age is in the range from 7 to 78 years.
  3. The mean age is 41 years and the median is 39 which is close to each other.

Exploratory of Datasets with observations

This section will be looking at the univariate, bivariate and multivariate analysis of the data. This would involve the plotting of data and interpreting the plots. Outliers would also be checked and remediated if necessary.

Univariate Analysis

The project data did not indicate whether the responses for "LIVE" and "DIE" is 1 or 2. As a result of this, 1 will represent "DIE" and 2 will represent "LIVE". The target variable has 79% of people alive and 21% dying.

Observations:

Observations:

Observations:

Observations:

Observation:

Observations:

Observations:

Observations:

Observations:

Observations:

Identifcation of outliers and its removal

An outlier is a data point that differs significantly from other observations in a dataset. From the above univariate analysis, there was a number of outliners in some of the variables in the data set. The following variables have outliners in them:

  1. Age
  2. Bilirubin
  3. Alk_Phosphate
  4. Sgot
  5. Albumin
  6. Protime

Using Interquartile Range (IQR) technique

This technique uses the IQR scores calculated to remove outliers. The rule of thumb is that anything not in the range of (Q1 - 1.5 IQR) and (Q3 + 1.5 IQR) is an outlier, and can be removed.

From the diagram above, all the outliers in the data have been removed.

Bivariate and mulivirate Analysis

This will be plotting 2 variables to compare their relationships.

Observations:

Observations:

Observations:

Observations:

Observations:

Observations:

Observations:

Correlation Visualisation after data pre-processing

Correlation is a statistical measure that expresses the extent to which two variables are linearly related (meaning they change together at a constant rate). Regarding the correlation diagram above, the following were my observations:

  1. Overall the correlation between the variables are not good.
  2. None of the correction was above 0.5 (50%).
  3. The highest correlation in the data was between Albumin and Ascites.
  4. In order to build a model with this data, a number of variables with less significant correlation needs to be dropped from the data.

Making Pairplots

Observation: The diagram above confirms the assertion that correlation between the variables are very low. Some of the relationship between the pairings between the variables is almost does not exist.

CONCLUSION

This project has been an extensive exercise of data analysis and visualization of the Hepatitis data provided. The following key observations were made during the data analysis:

  1. Most of the time was spent on cleaning the data and visualizing the data.
  2. The identified missing data were successfully treated. This was done by using the backfill method in Python.
  3. Identified outliers were successfully treated or removed.
  4. In terms of correlation, it was not good. In order to build a model with this data, a number of variables with less significant correlation needs to be dropped from the data.
  5. Majority of the most of the data had their mean and median being closed to each other. After the extensive cleaning of these data and analysis, we can firm that this wine data is relevant and can be use for further research work and statistical or machine learning model building.